Please sign in on etherpad: https://pad.carpentries.org/2020-09-10-06-r2

Episode data types

download.file("https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/gh-pages/_episodes_rmd/data/gapminder_data.csv", destfile = "data/gapminder_data.csv")
gapminder <- read.csv("data/gapminder_data.csv")

We can also read in file from web

gapminder <- read.csv("https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/gh-pages/_episodes_rmd/data/gapminder_data.csv")
str(gapminder) #str stands for structure 
## 'data.frame':    1704 obs. of  6 variables:
##  $ country  : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ pop      : num  8425333 9240934 10267083 11537966 13079460 ...
##  $ continent: chr  "Asia" "Asia" "Asia" "Asia" ...
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ gdpPercap: num  779 821 853 836 740 ...
View(gapminder) #to look at it in a tab 

Data types

R has a few data types it is good be aware of:

typeof(gapminder$year)
## [1] "integer"
typeof(gapminder$lifeExp)
## [1] "double"
typeof(3.14)
## [1] "double"
typeof(TRUE)  # logical 
## [1] "logical"
typeof(1L) # The L suffix forces the number to be an integer, since by default R uses float numbers
## [1] "integer"
typeof('bannana')
## [1] "character"
class(gapminder)
## [1] "data.frame"
typeof(gapminder$continent)
## [1] "character"
typeof(gapminder$country) #character
## [1] "character"
typeof(gapminder$year)
## [1] "integer"

Vectors

x <- c(1, 2.4, 3, 5) #what's the <- again?
x
## [1] 1.0 2.4 3.0 5.0
str(x)
##  num [1:4] 1 2.4 3 5
typeof(x)
## [1] "double"

Couple of things * the c() function is used in R a lot - stands for combine and it will create a vector * there are other ways to create a vector but we use this a lot. * what happens if we create a mixed vector

y = c("dog", 1.4, 3.5, TRUE)
y
## [1] "dog"  "1.4"  "3.5"  "TRUE"
str(y)
##  chr [1:4] "dog" "1.4" "3.5" "TRUE"
typeof(y)
## [1] "character"
char_vector_nums <- c('1','2','3')
typeof(as.numeric(char_vector_nums))
## [1] "double"

Data structures = realistic example

str(gapminder)
## 'data.frame':    1704 obs. of  6 variables:
##  $ country  : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ pop      : num  8425333 9240934 10267083 11537966 13079460 ...
##  $ continent: chr  "Asia" "Asia" "Asia" "Asia" ...
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ gdpPercap: num  779 821 853 836 740 ...
summary(gapminder$country)
##    Length     Class      Mode 
##      1704 character character
?summary
summary(gapminder$country) # tell us lenght of character vector 
##    Length     Class      Mode 
##      1704 character character
factor(gapminder$country)
summary(factor(gapminder$country))
##              Afghanistan                  Albania                  Algeria 
##                       12                       12                       12 
##                   Angola                Argentina                Australia 
##                       12                       12                       12 
##                  Austria                  Bahrain               Bangladesh 
##                       12                       12                       12 
##                  Belgium                    Benin                  Bolivia 
##                       12                       12                       12 
##   Bosnia and Herzegovina                 Botswana                   Brazil 
##                       12                       12                       12 
##                 Bulgaria             Burkina Faso                  Burundi 
##                       12                       12                       12 
##                 Cambodia                 Cameroon                   Canada 
##                       12                       12                       12 
## Central African Republic                     Chad                    Chile 
##                       12                       12                       12 
##                    China                 Colombia                  Comoros 
##                       12                       12                       12 
##          Congo Dem. Rep.               Congo Rep.               Costa Rica 
##                       12                       12                       12 
##            Cote d'Ivoire                  Croatia                     Cuba 
##                       12                       12                       12 
##           Czech Republic                  Denmark                 Djibouti 
##                       12                       12                       12 
##       Dominican Republic                  Ecuador                    Egypt 
##                       12                       12                       12 
##              El Salvador        Equatorial Guinea                  Eritrea 
##                       12                       12                       12 
##                 Ethiopia                  Finland                   France 
##                       12                       12                       12 
##                    Gabon                   Gambia                  Germany 
##                       12                       12                       12 
##                    Ghana                   Greece                Guatemala 
##                       12                       12                       12 
##                   Guinea            Guinea-Bissau                    Haiti 
##                       12                       12                       12 
##                 Honduras          Hong Kong China                  Hungary 
##                       12                       12                       12 
##                  Iceland                    India                Indonesia 
##                       12                       12                       12 
##                     Iran                     Iraq                  Ireland 
##                       12                       12                       12 
##                   Israel                    Italy                  Jamaica 
##                       12                       12                       12 
##                    Japan                   Jordan                    Kenya 
##                       12                       12                       12 
##          Korea Dem. Rep.               Korea Rep.                   Kuwait 
##                       12                       12                       12 
##                  Lebanon                  Lesotho                  Liberia 
##                       12                       12                       12 
##                    Libya               Madagascar                   Malawi 
##                       12                       12                       12 
##                 Malaysia                     Mali               Mauritania 
##                       12                       12                       12 
##                Mauritius                   Mexico                 Mongolia 
##                       12                       12                       12 
##               Montenegro                  Morocco               Mozambique 
##                       12                       12                       12 
##                  Myanmar                  Namibia                    Nepal 
##                       12                       12                       12 
##              Netherlands              New Zealand                Nicaragua 
##                       12                       12                       12 
##                    Niger                  Nigeria                   Norway 
##                       12                       12                       12 
##                     Oman                 Pakistan                   Panama 
##                       12                       12                       12 
##                  (Other) 
##                      516
gapminder$countr_fac <- factor(gapminder$country)
str(gapminder)
## 'data.frame':    1704 obs. of  7 variables:
##  $ country   : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
##  $ year      : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ pop       : num  8425333 9240934 10267083 11537966 13079460 ...
##  $ continent : chr  "Asia" "Asia" "Asia" "Asia" ...
##  $ lifeExp   : num  28.8 30.3 32 34 36.1 ...
##  $ gdpPercap : num  779 821 853 836 740 ...
##  $ countr_fac: Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
typeof(gapminder$year)
## [1] "integer"
typeof(gapminder$country)
## [1] "character"
str(gapminder$country)
##  chr [1:1704] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
str(gapminder$countr_fac)
##  Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
length(gapminder)
## [1] 7
typeof(gapminder)
## [1] "list"

Let’s explore some other functions we can use to inspect dataframes:

nrow(gapminder)
## [1] 1704
ncol(gapminder)
## [1] 7
dim(gapminder)
## [1] 1704    7
colnames(gapminder)
## [1] "country"    "year"       "pop"        "continent"  "lifeExp"   
## [6] "gdpPercap"  "countr_fac"
head(gapminder)
##       country year      pop continent lifeExp gdpPercap  countr_fac
## 1 Afghanistan 1952  8425333      Asia  28.801  779.4453 Afghanistan
## 2 Afghanistan 1957  9240934      Asia  30.332  820.8530 Afghanistan
## 3 Afghanistan 1962 10267083      Asia  31.997  853.1007 Afghanistan
## 4 Afghanistan 1967 11537966      Asia  34.020  836.1971 Afghanistan
## 5 Afghanistan 1972 13079460      Asia  36.088  739.9811 Afghanistan
## 6 Afghanistan 1977 14880372      Asia  38.438  786.1134 Afghanistan

Subsetting

life_exp <- gapminder[['lifeExp']]
str(life_exp)
##  num [1:1704] 28.8 30.3 32 34 36.1 ...
life_exp[1]
## [1] 28.801
life_exp[c(3,7)]
## [1] 31.997 39.854
1:4
## [1] 1 2 3 4
life_exp[1:4]
## [1] 28.801 30.332 31.997 34.020
life_exp[-1:4]

Why didnt that work? Yes, -1:4 expands to -1,0,1,2,3,4

-1:4
## [1] -1  0  1  2  3  4
life_exp[-(1:4)]

Also works:

life_exp[-c(1:4)]

Data frames

  • Bracket notation - if we use a single bracket with number, it will return
  • Single column as a dataframe
head(gapminder[3])
##        pop
## 1  8425333
## 2  9240934
## 3 10267083
## 4 11537966
## 5 13079460
## 6 14880372
str(gapminder[3])
## 'data.frame':    1704 obs. of  1 variable:
##  $ pop: num  8425333 9240934 10267083 11537966 13079460 ...

However if we use [[3]] it’ll return the column as a vector

head(gapminder[[3]])
## [1]  8425333  9240934 10267083 11537966 13079460 14880372
head(gapminder[["lifeExp"]])
## [1] 28.801 30.332 31.997 34.020 36.088 38.438
str(gapminder[[3]])
##  num [1:1704] 8425333 9240934 10267083 11537966 13079460 ...

A vector.

Think about it via this image:

Peeling onion

$ dollar sign can pull out a column by name. A lot easier to remember names than their numbers.

head(gapminder$year)
## [1] 1952 1957 1962 1967 1972 1977

We can pull out by rows and columns by using two arguments in []

gapminder[1:3,] #row 1-3 and all columns
##       country year      pop continent lifeExp gdpPercap  countr_fac
## 1 Afghanistan 1952  8425333      Asia  28.801  779.4453 Afghanistan
## 2 Afghanistan 1957  9240934      Asia  30.332  820.8530 Afghanistan
## 3 Afghanistan 1962 10267083      Asia  31.997  853.1007 Afghanistan
gapminder[3,] #   
##       country year      pop continent lifeExp gdpPercap  countr_fac
## 3 Afghanistan 1962 10267083      Asia  31.997  853.1007 Afghanistan
gapminder[3:10, 1:3]
##        country year      pop
## 3  Afghanistan 1962 10267083
## 4  Afghanistan 1967 11537966
## 5  Afghanistan 1972 13079460
## 6  Afghanistan 1977 14880372
## 7  Afghanistan 1982 12881816
## 8  Afghanistan 1987 13867957
## 9  Afghanistan 1992 16317921
## 10 Afghanistan 1997 22227415

Let’s subset gapminder and only include data from 87

gapminder[gapminder$year == 1987, ]

How about population greater than 15,000,000

gapminder[gapminder$pop >= 15000000,]

Selecting elements of a vector:

  • Selecting elements of a vector that match any of a list of components is a very common data analysis task
  • gapminder data set contains country and continent variables
  • Suppose we want to pull out information from southeast Asia:
  • how do we set up an operation to produce a logical vector that is TRUE for all of the countries in southeast Asia and FALSE otherwise?
  • Let’s walk thru this together & follow along:
seAsia <- c("Myanmar","Thailand","Cambodia","Vietnam","Laos")
## extract the `country` column from a data frame (we'll see this later);
## convert from a factor to a character;
## and get just the non-repeated elements
countries <- unique(gapminder$country)

One way:

(countries=="Myanmar" | countries=="Thailand" |
 countries=="Cambodia" | countries == "Vietnam" | countries=="Laos") 
##   [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [13] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
##  [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [85] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [97] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [109] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [121] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
## [133] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE

More elegant way and better way:

countries %in% seAsia
##   [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [13] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
##  [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [85] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [97] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [109] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [121] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
## [133] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE

How to use with gapminder:

gapminder[gapminder$country %in% seAsia, ]
##       country year      pop continent lifeExp gdpPercap countr_fac
## 217  Cambodia 1952  4693836      Asia  39.417  368.4693   Cambodia
## 218  Cambodia 1957  5322536      Asia  41.366  434.0383   Cambodia
## 219  Cambodia 1962  6083619      Asia  43.415  496.9136   Cambodia
## 220  Cambodia 1967  6960067      Asia  45.415  523.4323   Cambodia
## 221  Cambodia 1972  7450606      Asia  40.317  421.6240   Cambodia
## 222  Cambodia 1977  6978607      Asia  31.220  524.9722   Cambodia
## 223  Cambodia 1982  7272485      Asia  50.957  624.4755   Cambodia
## 224  Cambodia 1987  8371791      Asia  53.914  683.8956   Cambodia
## 225  Cambodia 1992 10150094      Asia  55.803  682.3032   Cambodia
## 226  Cambodia 1997 11782962      Asia  56.534  734.2852   Cambodia
## 227  Cambodia 2002 12926707      Asia  56.752  896.2260   Cambodia
## 228  Cambodia 2007 14131858      Asia  59.723 1713.7787   Cambodia
## 1045  Myanmar 1952 20092996      Asia  36.319  331.0000    Myanmar
## 1046  Myanmar 1957 21731844      Asia  41.905  350.0000    Myanmar
## 1047  Myanmar 1962 23634436      Asia  45.108  388.0000    Myanmar
## 1048  Myanmar 1967 25870271      Asia  49.379  349.0000    Myanmar
## 1049  Myanmar 1972 28466390      Asia  53.070  357.0000    Myanmar
## 1050  Myanmar 1977 31528087      Asia  56.059  371.0000    Myanmar
## 1051  Myanmar 1982 34680442      Asia  58.056  424.0000    Myanmar
## 1052  Myanmar 1987 38028578      Asia  58.339  385.0000    Myanmar
## 1053  Myanmar 1992 40546538      Asia  59.320  347.0000    Myanmar
## 1054  Myanmar 1997 43247867      Asia  60.328  415.0000    Myanmar
## 1055  Myanmar 2002 45598081      Asia  59.908  611.0000    Myanmar
## 1056  Myanmar 2007 47761980      Asia  62.069  944.0000    Myanmar
## 1525 Thailand 1952 21289402      Asia  50.848  757.7974   Thailand
## 1526 Thailand 1957 25041917      Asia  53.630  793.5774   Thailand
## 1527 Thailand 1962 29263397      Asia  56.061 1002.1992   Thailand
## 1528 Thailand 1967 34024249      Asia  58.285 1295.4607   Thailand
## 1529 Thailand 1972 39276153      Asia  60.405 1524.3589   Thailand
## 1530 Thailand 1977 44148285      Asia  62.494 1961.2246   Thailand
## 1531 Thailand 1982 48827160      Asia  64.597 2393.2198   Thailand
## 1532 Thailand 1987 52910342      Asia  66.084 2982.6538   Thailand
## 1533 Thailand 1992 56667095      Asia  67.298 4616.8965   Thailand
## 1534 Thailand 1997 60216677      Asia  67.521 5852.6255   Thailand
## 1535 Thailand 2002 62806748      Asia  68.564 5913.1875   Thailand
## 1536 Thailand 2007 65068149      Asia  70.616 7458.3963   Thailand
## 1645  Vietnam 1952 26246839      Asia  40.412  605.0665    Vietnam
## 1646  Vietnam 1957 28998543      Asia  42.887  676.2854    Vietnam
## 1647  Vietnam 1962 33796140      Asia  45.363  772.0492    Vietnam
## 1648  Vietnam 1967 39463910      Asia  47.838  637.1233    Vietnam
## 1649  Vietnam 1972 44655014      Asia  50.254  699.5016    Vietnam
## 1650  Vietnam 1977 50533506      Asia  55.764  713.5371    Vietnam
## 1651  Vietnam 1982 56142181      Asia  58.816  707.2358    Vietnam
## 1652  Vietnam 1987 62826491      Asia  62.820  820.7994    Vietnam
## 1653  Vietnam 1992 69940728      Asia  67.662  989.0231    Vietnam
## 1654  Vietnam 1997 76048996      Asia  70.672 1385.8968    Vietnam
## 1655  Vietnam 2002 80908147      Asia  73.017 1764.4567    Vietnam
## 1656  Vietnam 2007 85262356      Asia  74.249 2441.5764    Vietnam

Challenges (15 min)

Instructions: We will group you up in room. Select one person to drive the computer, the others will give instructions on how to solve the problems. The driver shares their screen. Switch up half way thru for fun.

https://docs.google.com/document/d/1TrX2BVMB0VpMTYvA--nXj6joy8cRDZz5Zy8zhHHu9-I/edit#

GGPLOT2 - 30min or less

But if not:

install.packages("ggplot2")
library("ggplot2")
ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
  geom_point()

ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
  geom_point(alpha = 0.5) + scale_x_log10()

ggplot(data = gapminder, mapping = aes(x=year, y=lifeExp, by=country, color=continent)) +
  geom_line()

Make the points stand out:

ggplot(data = gapminder, mapping = aes(x=year, y=lifeExp, by=country)) +
  geom_line(mapping = aes(color=continent)) + geom_point()

ggplot(data = gapminder, mapping = aes(x=year, y=lifeExp, by=country, color=continent)) +
  geom_line() + geom_point()

americas <- gapminder[gapminder$continent == "Americas",] #subset
ggplot(data = americas, mapping = aes(x = year, y = lifeExp)) +
  geom_line() +
  facet_wrap( ~ country) +
  theme(axis.text.x = element_text(angle = 45))

You’ll get hands-on tomorrow with Stephanie. Thanks everyone!